DATA1220-55, Fall 2024
2024-08-21
Correlation
Source: Figure 1.1 in https://r4ds.hadley.nz/intro.html
Chatfield, Chris (1996) Problem Solving: A Statistician’s Guide, 2nd ed.
RStudio Default Screen
Projects are a convenient way to keep all your files for an analysis in one place.
Go to File > New Project to begin one now. Call the project “homework1” and save it to your computer in a folder for this class.
R script
Quarto documents
End in .qmd and use markdown language to turn characters into formatted text.
Processes code in code chunks, and output appears directly in the document
Begin a new markdown script now
Your project now has it’s own “environment” in which you can store your data, variables and results.
Add a code chunk to your document, copy the code below, and run it.
Example:
Stored variable now appears in the environment
Packages are collections of functions to use for statistical analyses. Some are loaded automatically, and some need to be separately installed. Let’s install the tidyverse package.
Package Install & Update Panel
Either…
or…
Search for functions, packages, vignettes, and more directly in RStudio in the “Help” panel.
Help Panel
The Palmer Penguins
Horst AM, Hill AP, Gorman KB (2020). palmerpenguins: Palmer Archipelago (Antarctica) penguin data. R package version 0.1.0. https://allisonhorst.github.io/palmerpenguins/. doi: 10.5281/zenodo.3960218.
# A tibble: 6 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
# ℹ 2 more variables: sex <fct>, year <int>
Data structure in rows and columns like a spreadsheet
Rows: (ideally) uniquely identified observations
Columns: parameters which describe the observations
penguins have?penguins have?Rows: 344
Columns: 8
$ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex <fct> male, female, female, NA, female, male, female, male…
$ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
Use the Hmisc::describe() function to quickly summarize data.
penguins
8 Variables 344 Observations
--------------------------------------------------------------------------------
species
n missing distinct
344 0 3
Value Adelie Chinstrap Gentoo
Frequency 152 68 124
Proportion 0.442 0.198 0.360
--------------------------------------------------------------------------------
island
n missing distinct
344 0 3
Value Biscoe Dream Torgersen
Frequency 168 124 52
Proportion 0.488 0.360 0.151
--------------------------------------------------------------------------------
bill_length_mm
n missing distinct Info Mean Gmd .05 .10
342 2 164 1 43.92 6.274 35.70 36.60
.25 .50 .75 .90 .95
39.23 44.45 48.50 50.80 51.99
lowest : 32.1 33.1 33.5 34 34.1, highest: 55.1 55.8 55.9 58 59.6
--------------------------------------------------------------------------------
bill_depth_mm
n missing distinct Info Mean Gmd .05 .10
342 2 80 1 17.15 2.267 13.9 14.3
.25 .50 .75 .90 .95
15.6 17.3 18.7 19.5 20.0
lowest : 13.1 13.2 13.3 13.4 13.5, highest: 20.7 20.8 21.1 21.2 21.5
--------------------------------------------------------------------------------
flipper_length_mm
n missing distinct Info Mean Gmd .05 .10
342 2 55 0.999 200.9 16.03 181.0 185.0
.25 .50 .75 .90 .95
190.0 197.0 213.0 220.9 225.0
lowest : 172 174 176 178 179, highest: 226 228 229 230 231
--------------------------------------------------------------------------------
body_mass_g
n missing distinct Info Mean Gmd .05 .10
342 2 94 1 4202 911.8 3150 3300
.25 .50 .75 .90 .95
3550 4050 4750 5400 5650
lowest : 2700 2850 2900 2925 2975, highest: 5850 5950 6000 6050 6300
--------------------------------------------------------------------------------
sex
n missing distinct
333 11 2
Value female male
Frequency 165 168
Proportion 0.495 0.505
--------------------------------------------------------------------------------
year
n missing distinct Info Mean Gmd
344 0 3 0.888 2008 0.8919
Value 2007 2008 2009
Frequency 110 114 120
Proportion 0.320 0.331 0.349
For the frequency table, variable is rounded to the nearest 0
--------------------------------------------------------------------------------
Meet the Palmer Penguins
The key distinction we’ll make is between
Information that is quantitative describes a quantity.
Continuous variables (can take any value in a range) vs. Discrete variables (limited set of potential values)
Qualitative variables consist of names of categories.
| name | description |
|---|---|
species |
Penguin species: chinstrap, gentoo, adelie |
island |
Island where penguin was observed |
bill_length_mm |
how long is the bill from base to tip |
bill_depth_mm |
how wide is the bill from bottom to top |
flipper_length_mm |
length of flipper |
body_mass_g |
body mass |
sex |
male or female |
year |
2007, 2008, 2009 |
Histogram (bar plot)
Density, violin plot
Boxplot
Develop a research question
Examine summary statistics
Data exploration
How to use Quarto in RStudio: https://quarto.org/docs/get-started/hello/rstudio.html
Markdown language basics: https://quarto.org/docs/authoring/markdown-basics.html
Themes for projects: https://quarto.org/docs/output-formats/html-themes.html
https://canvas.jcu.edu/courses/36290 | DATA1220 Class 02 | 2024-08-21 | https://campuswire.com/c/G6427C531/feed